Integrated microRNA and proteome analysis of cancer datasets with MoPC

MicroRNAs (miRNAs) are small molecules that play an essential role in regulating gene expression by post-transcriptional gene silencing. Their study is crucial in revealing the fundamental processes underlying pathologies and, in particular, cancer. To date, most studies on miRNA regulation consider the effect of specific miRNAs on specific target mRNAs, providing wet-lab validation. However, few tools have been developed to explain the miRNA-mediated regulation at the protein level. In this paper, the MoPC computational tool is presented, that relies on the partial correlation between mRNAs and proteins conditioned on the miRNA expression to predict miRNA-target interactions in multi-omic datasets. MoPC returns the list of significant miRNA-target interactions and plot the significant correlations on the heatmap in which the miRNAs and targets are ordered by the chromosomal location. The software was applied on three TCGA/CPTAC datasets (breast, glioblastoma, and lung cancer), returning enriched results in three independent targets databases.


Introduction
MicroRNAs (miRNAs) are small RNA molecules that emerged as important regulators of gene expression at the post-transcriptional level.They are involved in many diverse biological processes [1], and their dysregulation may lead to various diseases, including cancer [2][3][4].Accurate prediction of miRNA-target interactions is critical to understanding the function of miRNAs.Although much progress has been made, target identification remains a challenge because of the limited understanding of the molecular basis of miRNA-target coupling, but also due to the context-dependence of post-transcriptional regulation and miRNA mode of action [5][6][7].Generally, miRNAs downregulate proteins through a combination of translational inhibition and promotion of mRNA decay [8][9][10], even though which mechanisms of action of microRNAs are the most dominant remains a matter of debate.The emergence of high-throughput methods in the past decades has allowed researchers to address the question of miRNA action on a global scale.High-throughput experiments have usually been focused on measuring miRNA-mediated changes at the mRNA level (degradation), allowing for the characterization of only a subset of direct targets [11][12][13][14].Further integration of global proteome profiling has allowed the investigation of miRNA-mediated gene expression changes at both the mRNA and the protein level [15,16], even though many of these studies are limited to one or a few miRNAs or proteins.Additionally, few published tools are available to integrate mRNA and protein profiles to infer miRNA regulation.ProteoMirExpress [17] infers active miRNAs based on the anticorrelation between miRNAs and their potential targets at either the mRNA or the protein level, or both.However, pairwise correlation approaches applied to omics profiles produce very dense networks containing direct and indirect associations that are very difficult to distinguish.
Here, we propose a novel in silico method, called MoPC (Multi-omics Partial Correlation analysis), to analyze post-transcriptional regulation by miRNAs by integrating information at both mRNA and protein level.The MoPC method predicts miRNA-target potential interactions using high-throughput mRNA, protein, and miRNA expression datasets from matched samples.It computes the conditional dependence (partial correlation) of mRNA and protein levels, taking into account the effect of a third variable, the expression of a given regulatory miRNA.The underlying hypothesis assumes that if a miRNA regulates a given gene, then the partial correlation between the mRNA and the protein expression of that gene conditioned by the expression of the miRNA will be greater than the bivariate correlation between the mRNA and protein levels of the gene.Therefore, to identify candidate miRNA targets, MoPC selects those genes whose partial correlation between the level of RNA and protein levels conditioned to the expression of a given miRNA is more significant than the bivariate correlation between the RNA and the protein.
MoPC has been applied to TCGA/CPTAC omics datasets of breast, glioblastoma, and lung cancer, predicting key miRNAs involved in these diseases.To assess the reliability of the interactions predicted by the MoPC, they have been tested for the enrichment in predicted targets by three independent databases, namely miRDB [18,19], TargetScan [20], and miRTarBase [21].The targets predicted by MoPC in breast and lung cancer show significant overlap with those predicted by the three independent databases.In contrast, the targets predicted in glioblastoma significantly overlap with targets predicted in TargetScan and miRTarBase.Finally, MoPC provides the user with a heatmap to visualize results and simultaneously explore the conditional correlation map of all genes and miRNAs included in the analysis.The user can visualize the interactions between multiple miRNAs and multiple targets by considering horizontal or vertical bands in the heatmap.

MoPC method
The MoPC workflow is reported in Fig 1 .To predict miRNA-target interactions by integrating information at both mRNA and protein level, MoPC calculates the partial correlation of the mRNA and protein levels of a gene conditioned by miRNA expression.Input requires three matrices of mRNA, protein, and miRNA expression denoted respectively as M, P, and MI with the same number of columns n, corresponding to the samples.The M and P matrices must have the same rows IDs corresponding to the genes for which the mRNA and protein levels were measured.Partial correlation r yx.z between mRNA x and protein expression y while accounting for the effect of a given miRNA z is computed as in Eq 1 (r yx stands for the bivariate correlation between variables y and x): ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 À r 2 yz q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The bivariate Pearson correlation r yx between mRNA expression x and protein expression y is also computed.Relevant miRNA-target interactions are selected as those for which a lower p-value is observed with partial correlation than that obtained with bivariate correlation.To ensure a large increase in the significance of the p-value, only interactions with an improvement in the p-value in the upper quartile of the distribution are selected.
The MoPC returns two outputs: 1. miRNA-target significant interactions.This list reports all miRNA-target candidate interactions based on the partial correlation measure.Additional columns contain the validation information (e.g., if the miRNA-target interaction was previously known in the literature according to three databases: miRDB, TargetScan, and miRTarBase).
2. summary heatmap.The heatmap reports genes on the rows and miRNAs on the column ordered according to their chromosomal location.Significant miRNA-targets are reported in red (the more intense the color, the higher the partial correlation estimate).MiRNA-targets both significant and validated in at least one of the three databases (miRDB, TargetScan, and miRTarBase) are reported in dark green.The user can identify miRNAs targeting multiple genes by visually inspecting horizontal or vertical bands in the heatmap.
The method is freely available as an R package at the following link: https://github.com/martalovino/MoPc.
In addition, we investigated the miRNA-targets found by MoPC in all three datasets (breast, lung, and glioblastoma) to search if they contain the seed region of the corresponding miRNA.Since the MoPC method does not exploit at all gene and miRNA sequences in the analysis, the presence of the miRNA seed in the target mRNA is of interest.The gene sequences were obtained by processing the human reference sequence and the annotation files (fasta and gtf format, respectively) from the ENSEMBL GRCh38, version 109, database.Each gene sequence has been selected as the sequence (from the start to the end coordinates) of the longest transcript.The miRNA sequences were obtained from miRDB in fasta format, and the miRNA seed region was identified from base 2 to base 8, counting from the miRNA 5p end.To verify the presence of the seed region in the target gene, we enforced an exact match.
The breast cancer dataset was obtained from the work of Meryins et al., published in 2016 [22].The published matrices contain the quantifications of 17814 genes, 7665 proteins, and 332 miRNAs for 77 samples, respectively.The glioblastoma dataset was obtained from the publication of Wang et al. of 2021 [23], including mRNA, protein, and miRNA expression values for 106 tumor samples.The input matrices contain, respectively, 45914 genes, 10998 proteins, and 2883 miRNAs.The lung squamous cell tumor dataset comes from the 2021 publication of Satpathy et al., in which mRNA, proteins, and miRNA expression are quantified [24].More than 200 samples have been analyzed, of which 201 have all three omics quantifications of our interest.Overall, 21792 genes, 11111 proteins, and 2585 miRNAs are reported.
The three datasets were processed according to the following procedure.These datasets contain missing values, particularly in protein and miRNA expression matrices.For our purposes, proteins and miRNAs with less than 10% of missing values across samples were kept.The remaining missing values were imputed by imposing the median value.Finally, the mRNA e miRNA matrices were transformed by applying the log2 function.
In all datasets, only the top 50% most highly variable proteins were included, since MoPC requires a certain variance in the expression profiles.
Table 1 shows the datasets size supplied to MoPC.

MoPC detects biologically relevant regulatory miRNAs in cancer datasets
MoPC was applied to three omic datasets from TCGA/CPTAC projects, namely breast, glioblastoma, and lung cancer.MoPC output returns 1) a table of miRNAs and predicted target genes and 2) a heatmap to visualize the chromosomal position of miRNAs and target genes.In particular, the table listing miRNAs and target genes contains both the estimate value and pvalue returned from the partial correlation (the higher the estimate and the lower the FDR adjusted p-value, the more significant the interaction between that gene and that miRNA).In addition, it returns if the gene-miRNA target interaction is already annotated in the three external databases (miRDB, TargetScan, and miRTarBase).

Computational validation
In order to test the reliability of MoPC predictions, the enrichment in miRDB, TargetScan, and miRTarBase predicted targets was assessed for each of the three cancer datasets, using hypergeometric test.In addition, the same enrichment test was performed for miRNA-target predictions based on the bivariate correlation between mRNA and miRNA expression and between protein and miRNA expression.Pearson correlation with FDR corrected p-values was computed and the miRNA-target pairs with a p-value lower than alpha 0.05 were selected.
The same conditions of the partial correlation were used to compute the bivariate correlations to perform a fair comparison.Tables 2-4 report for each of the three cancer datasets the percentage of miRNA-targets selected as significant, and the p-value of the hypergeometric test in each of the three databases (miRDB, TargetScan, and miRTarBase) both for the MoPC method and the mRNA-miRNA and protein-miRNA bivariate correlation.

MiRNA-targets visualization map
MoPC returns a heatmap where miRNAs and predicted targets are arranged according to their chromosomal position.The heatmap represents in red the miRNA-targets significantly predicted according to the MoPC partial correlation analysis and in dark green, the miRNAtargets statistically predicted and validated in at least one of the three databases (miRDB, Tar-getScan, and miRTarBase).The user can visually explore the presence of miRNAs that regulate multiple genes and vice-versa (the same gene regulated by multiple miRNAs).S1-S3 Figs show breast, glioblastoma, and lung cancer heatmaps, respectively.

Discussion
The goal of MoPC is to predict relevant miRNA-targets integrating information from miRNA, mRNA and proteome profiles.The predicted targets are validated with three reference databases (miRDB, TargetScan, and miRTarBase), both for the MoPC method and for the bivariate correlation between mRNA-miRNA and protein-miRNA.As reported in Tables 2-4, MoPC predictions are enriched in independent database targets for the three cancer datasets analyzed.Except for the enrichment of glioblastoma in miRDB, all enrichments are statistically significant.The miRNA-target pairs predicted from the bivariate mRNA-miRNA and protein-miRNA correlation are also statistically enriched for each cancer dataset and database, unless for breast cancer in the bivariate protein-miRNA correlation in the mirDB database.An essential aspect to compare is the total number of miRNA-target interactions predicted by MoPC and bivariate mRNA-miRNA and protein-miRNA correlation.Analyses of glioblastoma and lung cancer datasets benefit from the results obtained through MoPC.Indeed, MoPC returns 17.20%, 18.26%, and 18.49% of miRNA-target interactions for breast, glioblastoma, and lung cancer.Although the bivariate mRNA-miRNA and protein-miRNA Table 2. Validation of results on breast cancer dataset.The first two columns contain the results related to the bivariate correlation between mRNA-miRNA and protein-miRNA.The last column contains the MoPC results.Pearson correlation with FDR corrected p-values was computed and the miRNA-target pairs with a p-value lower than alpha 0.05 were selected.The same conditions of the partial correlation were used to compute the bivariate correlations to perform a fair comparison.correlation are generally more enriched in the three external target databases, they return a higher percentage of miRNA-target correlated pairs, typically greater than 36% for both mRNA-miRNA and protein-miRNA.This aspect is of uttermost importance, as the number of investigable gene-miRNA pairs is not clinically useful if it is close to 30-60% of all gene-miRNA pairs in the input dataset.MoPC proves to be an effective method in identifying key miRNAs within a dataset, reducing the number of targets while maintaining the statistical significance of the enrichment.

Bivariate correlation mRNA-miRNA
After the MoPC computational validation in the three external databases, we focused on specific miRNAs known in the literature to be associated with specific diseases.Indeed, the enrichment in miRDB, TargetScan, and miRTarBase verifies that the miRNA-target interaction is known or predicted in the literature.However, such enrichment does not consider whether the predicted miRNAs are key in the specific disease (breast, glioblastoma, lung cancer).Therefore we compared MoPC miRNA-target pairs with disease-associated miRNA in the literature [25][26][27] (see  for validation information in breast, gliobastoma and lung cancer, respectively. Regarding breast cancer, MoPC detects the main miRNAs involved in the cancer genesis, including miR-126-5p and miR-140-5p (with anti-angiogenesis and anti-tumorigenesis function) and miR-200b-5p and miR-200c-5p (promoting metastasis and invasion).Furthermore, among the miRNAs known in glioblastoma, MoPC successfully identifies the tumor suppressors miR-7-5p and miR-128-5p and the oncomiRs miR-21-5p and miR-93-5p, responsible for survival, proliferation, apoptosis, and drug resistance.For lung cancer, the miRNAs known and identified by MoPC are miR-7-5p and miR-21-5p (chemosensitive and chemoresistant, respectively) and miR-200c-5p and miR-223-5p with EGFR-TKI and ALK-TKI sensitivity.The Tables below show the top 5 percentile of miRNAs with the highest significant number of targets found by MoPC and their associated disease.
Table 8 shows for each dataset the percentage of miRNA-targets in which the seed was found in any region of the gene, the percentage in which the seed region is located at 5p and 3p of the gene, and the percentage in which the seed region is located in a coding region (CDS region).The last row shows the percentage in which the seed region is either on the 5p UTR, Table 4. Validation of results on lung cancer dataset.The first two columns contain the results related to the bivariate correlation between mRNA-miRNA and protein-miRNA.The last column contains the MoPC results.Pearson correlation with FDR corrected p-values was computed and the miRNA-target pairs with a p-value lower than alpha 0.05 were selected.The same conditions of the partial correlation were used to compute the bivariate correlations to perform a fair comparison.3p UTR, or CDS (the seed could be found in multiple locations, so the last row could not be the mathematical sum of the three previous ones).

Conclusion
Identifying key miRNAs is of fundamental importance for the early diagnosis and treatment of many diseases, including cancer.However, in the literature, various tools deal with providing a list of relevant miRNAs without exploiting the information coming from the expression of proteins.This paper presents MoPC, a novel method for predicting key miRNA-target pairs by exploiting the computation of the partial correlation between mRNA and protein expression conditioned by miRNA expression.The method is applied to three TCGA/CPTAC datasets: breast, glioblastoma, and lung cancer.The results show enrichment for each dataset in the three databases, miRDB, TargetScan, and miRTarBase, except for breast cancer in miRDB.Furthermore, MoPC successfully identifies key miRNAs in each cancer type.Finally, the user can inspect the results both through the list of significant miRNA-gene pairs and graphically through the generated heatmap, in which genes and miRNAs are chromosomally ordered.In the future, this approach can be extended to other types of omics data, given that the underlying biological phenomenon can be detected through partial correlation.hsa-let-7a-5p LIN28 Chemo sensitive [27] hsa-miR-26a-5p hsa-miR-183-5p hsa-let-7f-5p LIN28 Chemo sensitive [27] hsa-miR-30a-3p PI3K-SIAH2 EGFR TKI-resistant [27] hsa-miR-125a-5p hsa-miR-205-5p PTEN, Mcl-1 and Survivin Chemo resistant [27] hsa-miR-126-3p Akt and ERK pathways Chemo sensitiveEGFR TKI-sensitive [27] hsa-miR-7-5p EGFR Chemo sensitive [27] hsa-miR-9-3p https://doi.org/10.1371/journal.pone.0289699.t007

Fig 1 .
Fig 1. MoPC method in a glance.Input data consist of M, P, and MI matrices representing mRNA, proteomics, and miRNA expression matrices.First, the partial correlation matrix is computed.Note that this matrix has R m rows corresponding to the number of rows in M and P (genes) and R mi columns equal to the number of MI rows (miRNAs).Next, the improvement matrix is computed as the difference between the partial correlation p-value and the bivariate correlation p-value.Finally, relevant gene-miRNA interactions are chosen as the top quartile of positive improvement values.https://doi.org/10.1371/journal.pone.0289699.g001

Table 1 . Data dimensions as provided in input to MoPC.
For each dataset, the number of samples, quantified mRNA, proteins and miRNAs are reported after the adeguate preprocessing.

Table 3 .
Validation of results on glioblastoma dataset.The first two columns contain the results related to the bivariate correlation between mRNA-miRNA and protein-miRNA.The last column contains the MoPC results.Pearson correlation with FDR corrected p-values was computed and the miRNA-target pairs with a p-value lower than alpha 0.05 were selected.The same conditions of the partial correlation were used to compute the bivariate correlations to perform a fair comparison.

Table 8 . Presence of miRNA seed in miRNA-targets selected by MoPC.
Each cell contains the percentage of miRNA-targets detected by MoPC for the specific tissue (Breast, Lung, and Glioblastoma) in which the target mRNA contains the miRNA seed region (anywhere in the gene, in 5pUTR, 3pUTR, coding region).The last row is not the sum of the previous rows since the miRNA seed can be present in multiple gene locations. https://doi.org/10.1371/journal.pone.0289699.t008